Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

112 ◾ Bioinformatics

1110696, 1230237, and 1234567. The first two variants are single point substitutions. The

third position has two alternate alleles (G and T) that replaced the ref nucleotide (A). The

fourth variant is a deletion of a single nucleotide T since the alt allele is missing (“.”). In the

fifth position, there are two alternative alleles, the first is a deletion of two nucleotides (T

and C) and the second is an insertion of a single nucleotide T.

The QUAL column holds the quality level of the data at each position. The FILTER column

designates what filters can be applied; the keywords in this column can be used to filter the

variants as we will discuss later. The second row (position 17330) does not pass the thresh-

old for the quality of more than 10 Phred quality score.

The INFO column includes position-level information for that data row and can be

thought as aggregate data that includes all of the sample-level information specified.

The FORMAT column specifies the sample-level fields to expect under each sample.

Each row has the same format fields (GT, GQ, DP, and HQ) except for the last row which

does not have HQ. Each of these fields is described in the metadata section. GT (Genotype)

indicates which alleles separated by / are unphased or | phased, GQ is the Genotype

Quality which is a single integer, DP is the Read Depth which is a single integer, and HQ is

the Haplotype Quality, and it has two integers separated by a comma.

This VCF file has three samples identified by their names (NA00001, NA00002, and

NA00003) in columns 10 through 12.

Genetic variants discovered by researchers are submitted, usually in VCF files, to data-

bases that archive information of the genetic differences with other related information.

Researchers submit data to these databases, which collect, organize, and publicly docu-

ment the evidence supporting links between genetic variants and diseases or conditions.

The variants are usually submitted with their assertions, which are informed assessments

of the association or lack of association between a disease or condition and a genetic vari-

ant based on the current state of knowledge. The variant databases include dbSNP (for

human variants of lesser than 50 base pairs), dbVar (for human variants of greater than 50

base pairs), and European Variation Archive (for variants of all species).

Variant submitted to a database is given a unique identifier that can be used in finding

that variant in the database and the related information because they are unambiguous,

FIGURE 4.1 VCF file showing metadata and data sections.